Hands-on: Making sense of big data, machine learning, and modeling

Jameson Brennan and Hector Menendez

Department of Animal Science, South Dakota State University

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. The example below will print a statement and run a quick computation.

#Quick R example
print('ASAS NANP Workshop Demo' )
## [1] "ASAS NANP Workshop Demo"
#Quick R Example
apple=5
orange=10
apple+orange
## [1] 15

The objectives of this hands on training are: 1) introduce workshop participants to methods for streamlining data processing tasks in R, 2) demonstrate and provide examples of compiling large accelerometer datasets for determining daily livestock behavior; 3) introduce a suite of classification algorithms and validation testing approaches for classifying accelerometer training datasets, and 4) utilize model predictions to estimate and analyze daily behavior for beef cattle. The dataset used in this tutorial is a subset of data from Brennan et al. 2021.

Import libraries

Our first step to processing accelerometer data is to import the libraries we will use to run our analysis. Each library contains a set of functions which can be used to process data. For example, the function mean() would sum the values in a column and divide by the number of observations in the column. This code will look to see if the necessary packages are installed on your computer and if not install and load them.

##if there is an error and a package or dependency needs to be updated un-comment the code below and replace 'vctrs' with package
#remove.packages('vctrs')
#install.packages('vctrs')

#Needed packages
  list.of.packages <- c("lubridate","ggplot2",'dplyr','randomForest','plotly','class','caret','MASS','knitr','markdown','rpart.plot')
  new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
  if(length(new.packages)) install.packages(new.packages)
  library(lubridate)
  library(ggplot2)
  library(dplyr) 
  library(randomForest)
  library(plotly)
  library(caret)
  library(class)
  library(MASS)
  library(rpart)
  library(e1071)
  library(knitr)
  library(markdown)
  library(rpart.plot)

Loading in the data

The example dataset was generated using Gulf Coast X-16 mini accelerometer data logger set at 12hz and placed on a yearling steer in 2017. The data logger generates a file once 1,000,000 records have been recorded which is equal to roughly 1 day of data. Included in this tutorial are 5 raw datasets which is about 5 days worth of data. First we need to set the working directory to where the dataset is located.You can do this programatically or by clicking session -> set working directory -> choose directory. We will load in the dataset using the read.csv command and set the header to false and then view the data.

#Set working directory
#setwd("~/Conferences/NANP_2022/Workshop")

#Load in the raw data file and view first 15 records
Accel_df=read.csv('DATA-021.csv',header=F)
head(Accel_df,n=15L)
##                  V1                             V2            V3
## 1            ;Title  http://www.gcdataconcepts.com      X16-mini
## 2          ;Version                           1113    Build date
## 3       ;Start_time                     2017-06-10  14:22:08.011
## 4      ;Temperature                        -999.00         deg C
## 5       ;SampleRate                             12            Hz
## 6         ;Deadband                              0        counts
## 7  ;DeadbandTimeout                              0           sec
## 8             ;Time                             Ax            Ay
## 9             0.045                            789         -1404
## 10            0.128                           1096         -1843
## 11            0.211                           1060         -1116
## 12            0.294                            719          -429
## 13            0.377                            749          -720
## 14            0.460                           1012         -1050
## 15            0.543                            971         -1088
##                     V4                   V5  V6
## 1   Analog Dev ADXL345                         
## 2          Jan  6 2016   SN:CCDC1016DEE0180    
## 3                                              
## 4                 Vbat                 4062  mv
## 5                                              
## 6                                              
## 7                                              
## 8                   Az                         
## 9                 1451                         
## 10                1685                         
## 11                1576                         
## 12                1701                         
## 13                1587                         
## 14                1028                         
## 15                1190

As you can see from the output, the data comes with a lot of unnecessary header information. Of interest for us is the Time stamp, X, Y, and Z data, which starts on line 8. In addition, the beginning time is located on line 3 column v2 and v3. We will need to add our time column to this beginning starting point to calculate the time stamp for each record.

#Extract start date and time and convert to date time object
#note of row and column in slides but don't cover
start_time = paste(Accel_df[3,2],Accel_df[3,3])
start_time=as.POSIXct(start_time,format ="%Y-%m-%d %H:%M:%S")

#delete first 8 rows from dataframe
Accel_df=Accel_df[-c(1:8),]
#Use NULL to remove excess columns
Accel_df$V5=NULL
Accel_df$V6=NULL
#rename columns and convert to numeric
colnames(Accel_df)=c("Time","Ax","Ay","Az")
Accel_df$Time=as.numeric(Accel_df$Time) 
Accel_df$Ax=as.numeric(Accel_df$Ax)
Accel_df$Ay=as.numeric(Accel_df$Ay)
Accel_df$Az=as.numeric(Accel_df$Az)

#add start time to time and display cleaned up header dataframe
Accel_df$Time= start_time+Accel_df$Time
rownames(Accel_df) <- NULL
head(Accel_df)
##                  Time   Ax    Ay   Az
## 1 2017-06-10 14:22:08  789 -1404 1451
## 2 2017-06-10 14:22:08 1096 -1843 1685
## 3 2017-06-10 14:22:08 1060 -1116 1576
## 4 2017-06-10 14:22:08  719  -429 1701
## 5 2017-06-10 14:22:08  749  -720 1587
## 6 2017-06-10 14:22:08 1012 -1050 1028

Converting Units/Calculating Variables

Next we need to convert the data to G-forces (g’s) (which is the unit of measure for accelerometers). Per the company user manual, the X, Y, and Z variables raw output need to be divided by 2048 to convert to g’s. In addition, we want to calculate two other variables: movement intensity (MI) and signal amplitude (SMA). Previous research has demonstrated that these combined variables are beneficial for classifying livestock behavior. The formulas for those are:

\[ MI = \sqrt { x^2 + y^2 + z^2} \] \[ SMA= \lvert x \rvert + \lvert y \rvert + \lvert z \rvert \]

#convert to g 
Accel_df$Ax=Accel_df$Ax/2048
Accel_df$Ay=Accel_df$Ay/2048
Accel_df$Az=Accel_df$Az/2048
#Calculate MI and SMA
Accel_df$MI=sqrt(Accel_df$Ax^2 + Accel_df$Ay^2 + Accel_df$Az^2)
Accel_df$SMA=abs(Accel_df$Ax) + abs(Accel_df$Ay) + abs(Accel_df$Az)

head(Accel_df)
##                  Time        Ax         Ay        Az        MI      SMA
## 1 2017-06-10 14:22:08 0.3852539 -0.6855469 0.7084961 1.0584714 1.779297
## 2 2017-06-10 14:22:08 0.5351562 -0.8999023 0.8227539 1.3315932 2.257812
## 3 2017-06-10 14:22:08 0.5175781 -0.5449219 0.7695312 1.0756418 1.832031
## 4 2017-06-10 14:22:08 0.3510742 -0.2094727 0.8305664 0.9257281 1.391113
## 5 2017-06-10 14:22:08 0.3657227 -0.3515625 0.7749023 0.9261873 1.492188
## 6 2017-06-10 14:22:08 0.4941406 -0.5126953 0.5019531 0.8711994 1.508789

Aggregate Data

Creating time windows can reduce noise from the raw accelerometer output and help with classification. To accomplish this, we will round the time up to the nearest 5 second interval. You can change the interval based on desired level of aggregation (Gonzalez et al., 2015). From this rounded time we will compute the mean, minimum, maximum, and standard error of our X, Y, Z, MI, and SMA variables every 5 seconds. We will then merge these calculated time window measurements into a single data frame.

#round time to 5 second intervals
Accel_df$Time=lubridate::ceiling_date(Accel_df$Time,unit = "5 seconds") 
#González, Luciano A., et al. "Behavioral classification of data from collars containing motion sensors in grazing cattle." Computers and electronics in #agriculture 110 (2015): 91-102.
#Create Standard Error Function
standard_error <- function(x) sd(x) / sqrt(length(x))
#Calculate Mean, Max, Min, and SD for each 5 second time stamp 
Accel_mean=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=mean)
colnames(Accel_mean)=c("Time","X_Mean","Y_Mean","Z_Mean", "MI_Mean","SMA_Mean")
Accel_min=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=min)
colnames(Accel_min)=c("Time","X_Min","Y_Min","Z_Min","MI_Min","SMA_Min")
Accel_max=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=max)
colnames(Accel_max)=c("Time","X_Max","Y_Max","Z_Max","MI_Max","SMA_Max")
Accel_SE=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=standard_error )
colnames(Accel_SE)=c("Time","X_SE","Y_SE","Z_SE","MI_SE","SMA_SE")

#Combine into one dataframe

Accel_df=list(Accel_mean,Accel_max,Accel_min,Accel_SE)
Accel_df=Reduce(function(x, y) merge(x, y, all=TRUE), Accel_df)
head(Accel_df)
##                  Time    X_Mean     Y_Mean    Z_Mean  MI_Mean SMA_Mean
## 1 2017-06-10 14:22:10 0.4301961 -0.5261841 0.7205200 1.009853 1.676900
## 2 2017-06-10 14:22:15 0.4179036 -0.5146240 0.7266846 1.002592 1.659212
## 3 2017-06-10 14:22:20 0.3803467 -0.5254639 0.7467855 1.004776 1.652596
## 4 2017-06-10 14:22:25 0.4224513 -0.5035781 0.7137071 0.986538 1.639736
## 5 2017-06-10 14:22:30 0.3697428 -0.5133952 0.7609701 1.002796 1.644108
## 6 2017-06-10 14:22:35 0.4598307 -0.4775228 0.7189290 1.002290 1.656283
##       X_Max      Y_Max     Z_Max   MI_Max  SMA_Max     X_Min      Y_Min
## 1 0.7519531 -0.2094727 1.0322266 1.335298 2.282715 0.2153320 -0.8999023
## 2 0.7592773 -0.1806641 1.0336914 1.364981 2.311523 0.1557617 -0.8886719
## 3 0.6918945 -0.1572266 1.1235352 1.414516 2.330078 0.2045898 -0.9643555
## 4 0.6606445 -0.1582031 0.9902344 1.305036 2.239258 0.2246094 -0.8476562
## 5 0.5810547 -0.2685547 1.1215820 1.381742 2.234375 0.2109375 -0.7763672
## 6 0.8901367 -0.1738281 1.0405273 1.373862 2.300293 0.2275391 -0.9233398
##       Z_Min    MI_Min  SMA_Min       X_SE       Y_SE       Z_SE      MI_SE
## 1 0.4555664 0.8391981 1.348633 0.02507613 0.03454523 0.03050527 0.03270258
## 2 0.4599609 0.8359342 1.314941 0.01742707 0.01874789 0.01774744 0.01822701
## 3 0.4584961 0.7883861 1.252441 0.01233284 0.01987365 0.01925812 0.01976474
## 4 0.4262695 0.7809516 1.207520 0.01507468 0.01756646 0.01633678 0.01650232
## 5 0.3808594 0.6734779 1.093262 0.01211075 0.01572578 0.02093040 0.01967968
## 6 0.3720703 0.8271793 1.252930 0.01891206 0.02135721 0.01849351 0.01838327
##       SMA_SE
## 1 0.05946504
## 2 0.03348271
## 3 0.03364648
## 4 0.03047485
## 5 0.03114984
## 6 0.03254150

Now that we have aggregated our data by 5 second intervals we have reduced the data size from 1,000,000 records to 16,457 records, which is much more manageable. We can also start to plot data to see if there are any obvious activity patterns throughout the day.

ggplot2::ggplot(Accel_df,aes(x=Time,y=MI_Mean))+
  geom_line(color='Red')+
  ylab("MI Mean")+
  ggtitle('Movement Intensity Five Second Mean')

ggplot2::ggplot(Accel_df)+
  geom_line(aes(x=Time,y=X_Max),color='red')+
  geom_line(aes(x=Time,y=Y_Max),color='blue')+
  geom_line(aes(x=Time,y=Z_Max),color='black')+
  ylab("g")+
  ggtitle('X, Y, and Z Maximum Values')

Simpliying the code into a funtion

There are numerous steps that were required to process the dataset into a useable format that can be incorporated into machine learning models. In program R, we can use existing functions to help process and analyze our data, but we can also write our own functions.This can be helpful to simplify code and streamline data processing. Below we will wrap all of the previous data processing steps into a single custom function. The function is named Accel_function and takes the input datafile.

Accel_function=function (datafile){
  #Load in the raw data file and view first 15 records
  Accel_df=read.csv(datafile,header=F)
  start_time = paste(Accel_df[3,2],Accel_df[3,3])
  start_time=as.POSIXct(start_time,format ="%Y-%m-%d %H:%M:%S")
  
  #delete first 8 rows from dataframe and remove unneeded blank rows
  Accel_df=Accel_df[-c(1:8),]
  Accel_df$V5=NULL
  Accel_df$V6=NULL
  #rename columns and convert to numeric
  colnames(Accel_df)=c("Time","Ax","Ay","Az")
  Accel_df$Time=as.numeric(as.character(Accel_df$Time))
  Accel_df$Ax=as.numeric(as.character(Accel_df$Ax))
  Accel_df$Ay=as.numeric(as.character(Accel_df$Ay))
  Accel_df$Az=as.numeric(as.character(Accel_df$Az))
  
  #add start time to time and display cleaned up header dataframe
  Accel_df$Time= start_time+Accel_df$Time
  rownames(Accel_df) <- NULL
  
  #convert to g
  Accel_df$Ax=Accel_df$Ax/2048
  Accel_df$Ay=Accel_df$Ay/2048
  Accel_df$Az=Accel_df$Az/2048
  #Calculte MI and SMA
  Accel_df$MI=sqrt(Accel_df$Ax^2 + Accel_df$Ay^2 + Accel_df$Az^2)
  Accel_df$SMA=abs(Accel_df$Ax) + abs(Accel_df$Ay) + abs(Accel_df$Az)
  
  #round time to 5 second intervals
  Accel_df$Time=lubridate::ceiling_date(Accel_df$Time,unit = "5 seconds")
  standard_error <- function(x) sd(x) / sqrt(length(x))
  #Calculate Mean, Max, Min, and SD for each 5 second time stamp
  Accel_mean=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=mean)
  colnames(Accel_mean)=c("Time","X_Mean","Y_Mean","Z_Mean", "MI_Mean","SMA_Mean")
  Accel_min=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=min)
  colnames(Accel_min)=c("Time","X_Min","Y_Min","Z_Min","MI_Min","SMA_Min")
  Accel_max=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=max)
  colnames(Accel_max)=c("Time","X_Max","Y_Max","Z_Max","MI_Max","SMA_Max")
  Accel_SE=aggregate(Accel_df[,2:6],list(Accel_df$Time),FUN=standard_error )
  colnames(Accel_SE)=c("Time","X_SE","Y_SE","Z_SE","MI_SE","SMA_SE")
  
  #Combine into one dataframe
  
  Accel_df=list(Accel_mean,Accel_max,Accel_min,Accel_SE)
  Accel_df=Reduce(function(x, y) merge(x, y, all=TRUE), Accel_df)
  
  return(Accel_df)
  
  
}

Using the function

We can now call our function and process a datafile with a single line of code and display the output.

Accel_data=Accel_function('DATA-022.csv')
head(Accel_data)
##                  Time    X_Mean     Y_Mean    Z_Mean   MI_Mean SMA_Mean
## 1 2017-06-11 13:15:20 0.3492635 -0.4703878 0.7804871 0.9853950 1.600138
## 2 2017-06-11 13:15:25 0.5361654 -0.3397054 0.7276042 0.9803144 1.603475
## 3 2017-06-11 13:15:30 0.6746971 -0.2567697 0.6084895 1.0119527 1.580492
## 4 2017-06-11 13:15:35 0.5690267 -0.4343913 0.6648600 1.0152054 1.668278
## 5 2017-06-11 13:15:40 0.5214844 -0.4165853 0.7055583 0.9847082 1.643628
## 6 2017-06-11 13:15:45 0.4854818 -0.4696533 0.7023600 0.9894400 1.657495
##       X_Max       Y_Max     Z_Max   MI_Max  SMA_Max       X_Min      Y_Min
## 1 0.6884766 -0.24951172 0.9526367 1.240585 2.074707 0.200195312 -0.7568359
## 2 0.7416992 -0.15820312 0.9106445 1.100161 1.818359 0.115722656 -0.8256836
## 3 1.2934570  0.32861328 1.4755859 1.889156 3.264160 0.003417969 -1.0454102
## 4 1.3168945 -0.02441406 1.0747070 1.648346 2.719727 0.003906250 -0.9428711
## 5 0.7270508 -0.21191406 1.1723633 1.423899 2.313477 0.182617188 -0.7153320
## 6 0.9204102 -0.13183594 0.8925781 1.177397 1.958008 0.325683594 -0.8354492
##         Z_Min    MI_Min   SMA_Min       X_SE       Y_SE        Z_SE       MI_SE
## 1  0.60937500 0.8494963 1.2895508 0.01395528 0.01529155 0.010543921 0.012006165
## 2  0.58984375 0.9056680 1.4438477 0.01577842 0.01438799 0.007520017 0.005047960
## 3 -0.03173828 0.2503326 0.3183594 0.03613918 0.03193671 0.039869168 0.040356071
## 4  0.35302734 0.5414528 0.7905273 0.02854079 0.02475697 0.019483584 0.022737444
## 5  0.47998047 0.6477652 1.0717773 0.01509846 0.01312463 0.014251036 0.012519064
## 6  0.56201172 0.8579398 1.4350586 0.01626596 0.01469610 0.008986108 0.007921131
##        SMA_SE
## 1 0.022556001
## 2 0.009157867
## 3 0.067388964
## 4 0.038089568
## 5 0.020227060
## 6 0.012894330

Applying our function to a list of files

In this example, there are 5 data files that need to be processed and merged together from one animal. You could run the function five separate times and merge each file together. However, if this was the full accelerometer dataset with 90+ files for each individual animal this would become a labor intensive process. To further automate the process, we will extract the names of all the files in the directory that match a pattern, in this case ‘DATA-’, and apply our function to each of these files. We will then bind the rows together to make one dataset from all the files.

#extract the names of all files that match the string 'DATA-'
filenames=list.files(getwd(),pattern = "DATA-",all.files = FALSE)
#Process the list of files to create a data frame with all five data files merged together
Accel_Merged <- dplyr::bind_rows(lapply(filenames[1:length(filenames)], Accel_function))
#Print Dimensions of Data
dim(Accel_Merged)
## [1] 82281    21

Using the dim function (dimensions), we see that we now have a dataframe with 82,281 rows and 21 columns. Though it took a few minutes to run, we were able to process 5 million raw accelerometer records split into 5 files with just a few lines of code. From the plot below you can now see the signal amplitude standard error calculated every five seconds for a five day period.

ggplot2::ggplot(Accel_Merged,aes(x=Time,y=SMA_SE))+
  geom_line(color='Red')+
  ylab("SMA Standard Error")+
  ggtitle('Signal Amplitude Standard Error Five Second Mean')

Training dataset

Now that we have demonstrated on to process the raw accelerometer files, we will move onto the training data set for our machine learning models. This data set consists of 11,478 labeled 5 second accelerometer observations. Visual observations of individual animals in the field were used to label the data. First thing we will do is load in the training dataset, look at the column names, and the number of observations for each behavior using the ‘table’ function.

observed_df=read.csv('Model_Training_Data.csv')
#Set Behavior as factor
observed_df$Behavior=as.factor(observed_df$Behavior)
#print column names
colnames(observed_df)
##  [1] "Time"     "X_Mean"   "Y_Mean"   "Z_Mean"   "MI_Mean"  "SMA_Mean"
##  [7] "X_Max"    "Y_Max"    "Z_Max"    "MI_Max"   "SMA_Max"  "X_Min"   
## [13] "Y_Min"    "Z_Min"    "MI_Min"   "SMA_Min"  "MI_SE"    "SMA_SE"  
## [19] "X_SE"     "Y_SE"     "Z_SE"     "Behavior"
#print count of each behavior
table(observed_df$Behavior)
## 
##    G    R    W 
## 6738 4520  220

From our output, we can see that the dataset contains the same variables as our processed dataset. In addition, there is another column labeled ‘Behavior’. The output from the ‘table’ function shows that there are 6,738 G (graze) observations, 4520 R (resting) observations, and 220 W (Walking) observations.

One of the first steps to analyzing data is to visualize the data by making a number of different plots. This can help us understand data distributions, challenges that may exists in the modeling process, and sources of error.

ggplot(observed_df, aes(x=X_Mean,color=Behavior)) + 
  geom_histogram()+
  facet_wrap(~Behavior)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histogram, we can see that grazing starts declining sharply between 0 and 0.25 while the resting increase during this period. There is some overlap between the behaviors that might be a source of model error.

ggplot(observed_df, aes(x=MI_SE,color=Behavior)) + 
  geom_density(lwd=1.5)

The density plot shows a shift to the left (higher standard error measurements) of the grazing and walking behavior versus the resting behavior.

ggplot(observed_df,aes(x=MI_SE,y=X_SE,color=Behavior))+
  geom_point()+
  xlim(0,0.05)+
  xlab("X Standard Error")+
  ylab('MI Standard Error')
## Warning: Removed 25 rows containing missing values (geom_point).

From our scatter plot we can see a good separation of grazing and resting behavior. We can even view the plots in 3d to help visualize separation of behaviors in space.

plotly:: plot_ly(observed_df, x=~SMA_SE, y=~X_Mean, 
        z=~MI_Min, color=~Behavior)
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Split Dataset into train/test

For this example we are going to use two different validation approaches for assessing our model accuracy.

The validation set approach (VSA) divides the observation dataset into a training and testing dataset. The training dataset is used to fit the different models which are then used to predict behavior on the test dataset to get an unbiased estimate of model accuracy.

The second approach will use cross validation (CV). CV splits the data into a number of different groups (folds). The model is fit iteratively and tested on each fold. The average error rate on all folds will be reported. For many machine learning applications, 5 or 10 fold cross-validation is commonly used. For our example, we will use 5. As we increase folds we increase the computational time to run the analysis.

Before we split our dataset, we want to use the ‘set.seed’ function. Although it will randomly split the dataset into a train/test, by setting the seed, we can reproduce our results because it will use the same random sample each time. In the code below, we will split our dataset into an 80/20 train/test dataset.

#setting seed allows us to reproduce the exact results
set.seed(314)
#This example will do a 80/20 train/test. You can change the 0.8 to alter this ratio
train_data_index <- sample(1:nrow(observed_df), 0.8 * nrow(observed_df))
test_data_index <- setdiff(1:nrow(observed_df), train_data_index)

# Build train and test datset
train_data <- observed_df[train_data_index,]
test_data <- observed_df[test_data_index, ]

KNN

The first method we will test is k-nearest neighbor. This will fit a KNN model, apply it to the test dataset, and compare the observed versus predicted values to assess model accuracy.

#KNN VSA method 
#create train dataset with only predictors, columns 2-21 are our predictor variables (from the accelerometer)
train.knn=train_data[,2:21]
#test dataset only predictors
test.knn=test_data[,2:21]
#fit KNN <-#HMM EDIT## model with three nearest neighbors, assign prediction to test_data column KNN
test_data$KNN=knn(train.knn,test.knn,train_data$Behavior,k=3) #K=3?
#compare prediction with observed on test dataset
caret::confusionMatrix(test_data$KNN,test_data$Behavior) 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    G    R    W
##          G 1277   81   19
##          R   65  809    6
##          W   14    3   22
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9181         
##                  95% CI : (0.9061, 0.929)
##     No Information Rate : 0.5906         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.835          
##                                          
##  Mcnemar's Test P-Value : 0.3193         
## 
## Statistics by Class:
## 
##                      Class: G Class: R Class: W
## Sensitivity            0.9417   0.9059 0.468085
## Specificity            0.8936   0.9494 0.992441
## Pos Pred Value         0.9274   0.9193 0.564103
## Neg Pred Value         0.9140   0.9407 0.988923
## Prevalence             0.5906   0.3889 0.020470
## Detection Rate         0.5562   0.3524 0.009582
## Detection Prevalence   0.5997   0.3833 0.016986
## Balanced Accuracy      0.9177   0.9277 0.730263
#store accuracy for table later on
knn.vsa=as.numeric(caret::confusionMatrix(test_data$KNN,test_data$Behavior)$overall[1])

Next we will fit a KNN model using the cross validation approach. Note that this approach uses the observed dataset. For each iteration, a KNN model is fit on 80% of the data and tested on 20%. The accuracy reported is the average of those 5 iterations.

#KNN 5 fold Cross validation
#library(caret) #already loaded caret above? HMM##
set.seed(314)
#Train model using 10 fold cv
knn.cv=train(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
              tuneGrid=expand.grid(k=3),
              method='knn',
              trControl=trainControl(method = "cv",number=5), #change number to change number of folds
              metric="Accuracy",
              data = observed_df)
knn.cv
## k-Nearest Neighbors 
## 
## 11478 samples
##    20 predictor
##     3 classes: 'G', 'R', 'W' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9182, 9182, 9183, 9183, 9182 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9214148  0.8409239
## 
## Tuning parameter 'k' was held constant at a value of 3
#store accuracy for later
knn.cv=knn.cv$results$Accuracy

Linear Discriminant Analysis

Our next function to test is linear discriminant analysis (LDA).

#####LDA VSA Approach

lda.vsa=lda(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max               + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
            data=train_data)
#predict behavior on test dataset using the model
test_data$LDA_VSA=predict(lda.vsa,test_data,type="response")$class
caret::confusionMatrix(test_data$LDA_VSA,test_data$Behavior)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    G    R    W
##          G 1289   85   36
##          R   64  808   10
##          W    3    0    1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9138          
##                  95% CI : (0.9015, 0.9249)
##     No Information Rate : 0.5906          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8232          
##                                           
##  Mcnemar's Test P-Value : 6.924e-09       
## 
## Statistics by Class:
## 
##                      Class: G Class: R  Class: W
## Sensitivity            0.9506   0.9048 0.0212766
## Specificity            0.8713   0.9473 0.9986661
## Pos Pred Value         0.9142   0.9161 0.2500000
## Neg Pred Value         0.9244   0.9399 0.9799302
## Prevalence             0.5906   0.3889 0.0204704
## Detection Rate         0.5614   0.3519 0.0004355
## Detection Prevalence   0.6141   0.3841 0.0017422
## Balanced Accuracy      0.9109   0.9260 0.5099713
#store model accuracy
LDA_VSA=as.numeric(caret::confusionMatrix(test_data$LDA_VSA,test_data$Behavior)$overall[1])

We can also generate stacked histograms of our predictions.

ldahist(test_data$X_Mean,test_data$LDA_VSA)

To run the CV on LDA we will write our own function.

#We need to create a cross validation function. This will loop through each k fold, fit the model, and store and average the error rates

cv.lda <-function (data, model=origin~., yname="origin", K=5, seed=314) {
    n <- nrow(data)
    set.seed(seed)
    datay=data[,yname] #response variable
    #partition the data into K subsets
    f <- ceiling(n/K)
    s <- sample(rep(1:K, f), n)  
    #generate indices 1:10 and sample n of them  
    # K fold cross-validated error
    CV=NULL
    for (i in 1:K) { #i=1
      test.index <- seq_len(n)[(s == i)] #test data
      train.index <- seq_len(n)[(s != i)] #training data
      
      #model with training data
      lda.fit=lda(model, data=data[train.index,])
      #observed test set y
      lda.y <- data[test.index, yname]
      #predicted test set y
      lda.predy=predict(lda.fit, data[test.index,])$class
      
      #observed - predicted on test data
      error= mean(lda.y!=lda.predy)
      #error rates 
      CV=c(CV,error)
    }
    #Output
    list(call = model, K = K,error=CV, 
         lda_error_rate = mean(CV), seed = seed)  
  }

#Use our function to run the CV
lda.kfold=cv.lda(data=observed_df,
                 model = Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max +  MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
                 yname="Behavior",
                 K=5,
                 seed = 314)
#Show output and store accuracy
lda.kfold
## $call
## Behavior ~ X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + 
##     Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + 
##     MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE
## 
## $K
## [1] 5
## 
## $error
## [1] 0.08972125 0.08667247 0.09277003 0.07360627 0.08761988
## 
## $lda_error_rate
## [1] 0.08607798
## 
## $seed
## [1] 314
lda.cv=1-lda.kfold$lda_error_rate

Random Forest

Random forest is a machine learning method that uses recursive partitioning (decision trees) to build a model to break our data down into a series of splits resulting in a prediction. A simple example from our dataset can be seen below.

#Example of recursive partitioning
rpart_vsa=rpart(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max +    SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE
                 ,data=observed_df,control = rpart.control(10)) #rpart parameter?10?

#Plot the decision tree
rpart.plot:: rpart.plot(rpart_vsa)

Though these models are useful for interpretation, more complex tree based models often have better performance. Random forest uses a similar process except it fits a large number of trees using bootstrapped sample and random number of features for each tree.

#Set Seed
set.seed(314)
#Fit Random Forest Model
rf_vsa=randomForest( Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max +                      SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE 
                    ,ntree=1000,data=train_data)
#Predict Behavior on test dataset
test_data$rf_vsa=predict(rf_vsa,newdata=test_data)
#Generate confusion matrix and save accuracy
caret::confusionMatrix(test_data$rf_vsa,test_data$Behavior)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    G    R    W
##          G 1301   41   24
##          R   54  852    9
##          W    1    0   14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9438          
##                  95% CI : (0.9336, 0.9529)
##     No Information Rate : 0.5906          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8861          
##                                           
##  Mcnemar's Test P-Value : 5.391e-07       
## 
## Statistics by Class:
## 
##                      Class: G Class: R Class: W
## Sensitivity            0.9594   0.9541 0.297872
## Specificity            0.9309   0.9551 0.999555
## Pos Pred Value         0.9524   0.9311 0.933333
## Neg Pred Value         0.9409   0.9703 0.985533
## Prevalence             0.5906   0.3889 0.020470
## Detection Rate         0.5666   0.3711 0.006098
## Detection Prevalence   0.5949   0.3985 0.006533
## Balanced Accuracy      0.9451   0.9546 0.648714
vsa_rf=as.numeric(caret::confusionMatrix(test_data$rf_vsa,test_data$Behavior)$overall[1])

We can also look at what variable are most important in the model.

varImpPlot(rf_vsa,main="Variable Importance Plot RF Model")

Again, for the RF model we will use our own cross validation function to test model accuracy.

rf.cv=function (data, model=origin~., yname="origin", K=5, seed=314) {
  n <- nrow(data)
  set.seed(seed)
  datay=data[,yname] #response variable
  #partition the data into K subsets
  f <- ceiling(n/K)
  s <- sample(rep(1:K, f), n)  
  #generate indices 1:10 and sample n of them  
  # K fold cross-validated error
  
  CV=NULL
  #i=3
  for (i in 1:K) { #i=1
    test.index <- seq_len(n)[(s == i)] #test data
    train.index <- seq_len(n)[(s != i)] #training data
    
    #model with training data
    rf.fit=randomForest(model, data=data[train.index,])
    #observed test set y
    rf.y <- data[test.index, yname]
    #predicted test set y
    rf.predy=predict(rf.fit, data[test.index,])
    
    #observed - predicted on test data
    error= mean(rf.y!=rf.predy)
    #error rates 
    CV=c(CV,error)
  }
  #Output
  list(call = model, K = K,error=CV, 
       rf_error_rate = mean(CV), seed = seed)  
}
#Run function for cross validation using random forest
cv_rf=rf.cv(data = observed_df,
            model =  Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max +                      SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
            yname="Behavior",
            K=5,
            seed = 314)
#Cross validation output
cv_rf
## $call
## Behavior ~ X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + 
##     Y_Max + Z_Max + MI_Max + SMA_Max + X_Min + Y_Min + Z_Min + 
##     MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE
## 
## $K
## [1] 5
## 
## $error
## [1] 0.05923345 0.05792683 0.05487805 0.04355401 0.05100262
## 
## $rf_error_rate
## [1] 0.05331899
## 
## $seed
## [1] 314
#store accuracy
cv_rf=1-cv_rf$rf_error_rate

Support vector machine

The next model we will fit is a support vector machine (SVM) using the validation set approach

#Fit SVM model 
svm_mod=svm(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max                     + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
            data = train_data,kernel='linear')
#Use model to predict on test dataset
test_data$SVM_VSA=predict(svm_mod,test_data)
#Calculate confusion matrix and save accuracy
svm_vsa=as.numeric(caret::confusionMatrix(test_data$Behavior,test_data$SVM_VSA)$overall[1])
caret::confusionMatrix(test_data$Behavior,test_data$SVM_VSA)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    G    R    W
##          G 1285   71    0
##          R   71  822    0
##          W   39    8    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9177          
##                  95% CI : (0.9057, 0.9286)
##     No Information Rate : 0.6076          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8315          
##                                           
##  Mcnemar's Test P-Value : 3.476e-10       
## 
## Statistics by Class:
## 
##                      Class: G Class: R Class: W
## Sensitivity            0.9211   0.9123       NA
## Specificity            0.9212   0.9491  0.97953
## Pos Pred Value         0.9476   0.9205       NA
## Neg Pred Value         0.8830   0.9437       NA
## Prevalence             0.6076   0.3924  0.00000
## Detection Rate         0.5597   0.3580  0.00000
## Detection Prevalence   0.5906   0.3889  0.02047
## Balanced Accuracy      0.9212   0.9307       NA

SVM using 5 fold CV.

svm.cv=function (data, model=origin~., yname="origin", K=5, seed=314) {
  n <- nrow(data)
  set.seed(seed)
  datay=data[,yname] #response variable
  #partition the data into K subsets
  f <- ceiling(n/K)
  s <- sample(rep(1:K, f), n)  
  #generate indices 1:10 and sample n of them  
  # K fold cross-validated error
  
  CV=NULL
  #i=3
  for (i in 1:K) { #i=1
    test.index <- seq_len(n)[(s == i)] #test data
    train.index <- seq_len(n)[(s != i)] #training data
    
    #model with training data
    svm.fit=svm(model, data=data[train.index,],kernel='linear')
    #observed test set y
    svm.y <- data[test.index, yname]
    #predicted test set y
    svm.predy=predict(svm.fit, data[test.index,])
    
    #observed - predicted on test data
    error= mean(svm.y!=svm.predy)
    #error rates 
    CV=c(CV,error)
  }
  #Output
  list(call = model, K = K,error=CV, 
       svm_error_rate = mean(CV), seed = seed)  
}
#Run function for cross validation using random forest
cv_svm=svm.cv(data = observed_df,
            model =  Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max +                      SMA_Max + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
            yname="Behavior",
            K=5,
            seed = 314)
#store accuracy
cv_svm=1-cv_svm$svm_error_rate
#Cross validation output
cv_svm
## [1] 0.9174071

Display Table of Results

Now that we have run our selected machine learning models, we can display the accuracy for each model and testing scheme into a table. Note that though we ran 4 models above, there are many more models that could be run and compared for accuracy.

#create dataframe of accuracy and models
final_table=as.data.frame(rbind(c(knn.vsa*100,knn.cv*100),c(LDA_VSA*100,lda.cv*100),c(vsa_rf*100,cv_rf*100),c(svm_vsa*100,cv_svm*100)))
final_table$Model=c("KNN","LDA","RF","SVM")
colnames(final_table)=c("VSA Accuracy","10-Fold CV Accuracy","Model")
final_table<- final_table[, c(3,1,2)]

rownames(final_table)=NULL


knitr::kable(final_table,digits=1,caption="Model Accuracy (%) for Validation and CV Approaches")
Model Accuracy (%) for Validation and CV Approaches
Model VSA Accuracy 10-Fold CV Accuracy
KNN 91.8 92.1
LDA 91.4 91.4
RF 94.4 94.7
SVM 91.8 91.7

Parameter tuning

For the training examples above, we ran each of our models with the default settings and evaluated their accuracy. For most machine learning models, their are additional parameters that can be adjusted that may improve our model accuracy. For example, for KNN we used 3 neighbors to make predictions, but we could also evaluate which number of neighbors provides the optimal solution. In the example below, we will use grid search to test the cross validation accuracy of 1,3,5,7 neighbors.

#KNN 10 fold Cross validation
#library(caret) #already loaded. (Is attached used?HMM)
set.seed(314)
#Train model using 10 fold cv
knn.cv=train(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max                  + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
              tuneGrid=expand.grid(k=c(1,3,5,7)),
              method='knn',
              trControl=trainControl(method = "cv",number=5), #change number to change number of folds
              metric="Accuracy",
              data = observed_df)
knn.cv
## k-Nearest Neighbors 
## 
## 11478 samples
##    20 predictor
##     3 classes: 'G', 'R', 'W' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 9182, 9182, 9183, 9183, 9182 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   1  0.9073881  0.8143589
##   3  0.9211532  0.8403522
##   5  0.9228961  0.8432601
##   7  0.9229832  0.8430419
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
#store accuracy for later
knn.cv=knn.cv$results$Accuracy

We can see from the output that the model with k=7 neighbors provided the best results.

Other models such as support vector machines have similar parameters that can be tuned. For example, we use cross validation to tune the cost parameter and evaluate model accuracy.

set.seed(314)

tune_svm=tune(svm,Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max                  + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
              data = observed_df,kernel='linear',ranges = list(cost=c(.01,10)))
summary(tune_svm)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.08250526 
## 
## - Detailed performance results:
##    cost      error  dispersion
## 1  0.01 0.08442149 0.006576775
## 2 10.00 0.08250526 0.006755321

We can also adjust the kernel type to see if a linear, polynomial, or radial kernel provides better results.

#Fit SVM model 
svm_lin=svm(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max                     + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
            data = train_data,kernel='linear')
#Use model to predict on test dataset
test_data$SVM_lin=predict(svm_lin,test_data)
#Calculate confusion matrix and save accuracy
svm_lin=as.numeric(caret::confusionMatrix(test_data$Behavior,test_data$SVM_lin)$overall[1])

#Fit Radial Model
svm_rad=svm(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max                     + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
            data = train_data,kernel='radial')
#Use model to predict on test dataset
test_data$SVM_rad=predict(svm_rad,test_data)
#Calculate confusion matrix and save accuracy
svm_rad=as.numeric(caret::confusionMatrix(test_data$Behavior,test_data$SVM_rad)$overall[1])

#Fit Polynomial Model
svm_poly=svm(Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max                     + X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE,
            data = train_data,kernel='polynomial')
#Use model to predict on test dataset
test_data$SVM_poly=predict(svm_poly,test_data)
#Calculate confusion matrix and save accuracy
svm_poly=as.numeric(caret::confusionMatrix(test_data$Behavior,test_data$SVM_poly)$overall[1])

#Print accuracy of each
print(paste('Linear:',svm_lin,',', "Radial:",svm_rad,',', "Polynomial:",svm_poly))
## [1] "Linear: 0.917682926829268 , Radial: 0.937282229965157 , Polynomial: 0.930749128919861"

We can see from the output the radial kernel had the highest accuracy. By adjusting the parameters of the model we may further enhance our predictive capabilities. For now we will continue on with our random forest model that had the highest accuracy.

Model Selection for Deployment

Based on the table above, the random forest model had the highest accuracy of all the models. We will select the RF model for predicting all our un-observed accelerometer data. To do so, we now want to fit a RF model as above, but using all available data to us. Since this uses all available data to fit the model, we are unable to get an unbiased accuracy on a testing dataset. The error estimate below is the training error rate. It is important to note that the training error is close to our testing error.

set.seed(314)
rf_deploy=randomForest( Behavior~X_Mean + Y_Mean + Z_Mean + MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max + 
                   X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE + X_SE + Y_SE + Z_SE 
                            ,ntree=1000,data=observed_df)
rf_deploy
## 
## Call:
##  randomForest(formula = Behavior ~ X_Mean + Y_Mean + Z_Mean +      MI_Mean + SMA_Mean + X_Max + Y_Max + Z_Max + MI_Max + SMA_Max +      X_Min + Y_Min + Z_Min + MI_Min + SMA_Min + MI_SE + SMA_SE +      X_SE + Y_SE + Z_SE, data = observed_df, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.07%
## Confusion matrix:
##      G    R  W class.error
## G 6535  195  8  0.03012763
## R  218 4302  0  0.04823009
## W  121   40 59  0.73181818

Now that we have fit our final model, we want to use it to predict behavior on all unobserved accelerometer data. To do this we will use the ‘predict’ function with the inputs of our final model (rf_deploy) and the accelerometer dataset we had processed earlier from the 5 raw files.

Accel_Merged$Behavior=predict(rf_deploy,Accel_Merged)

Estimating Daily Behavior

Though our model had an accuracy of ~95%, it is important to make sure model predictions make biological sense. For instance if our predictions on the accelerometer data indicated that an animal was resting 20 hours a day and grazing between 1-5 in the morning we might suspect that our model was doing a poor job predicting unobserved behavior. These a priori knowledge can come from experience or from visualizing the observed data in plots like histograms. The line of code below will construct a dataframe of daily behavior estimates in “minutes” for resting, walking, and grazing behavior.

#convert date time to only date
  Accel_Merged$Date= as.Date(Accel_Merged$Time, format =  "%m/%d/%Y")

#subset out walk predictions, count the number per day and convert back to minutes
  df_pred_walk=subset(Accel_Merged,Behavior=='W')
  df_pred_walk=aggregate(df_pred_walk$Behavior,by=list(c(df_pred_walk$Date)),FUN=length)
  colnames(df_pred_walk)=c("Date","walk_Min")
  df_pred_walk$walk_Min=(df_pred_walk$walk_Min*5)/60
#subset out graze predictions, count the number per day and convert back to minutes
  df_pred_graze=subset(Accel_Merged,Behavior=='G')
  df_pred_graze=aggregate(df_pred_graze$Behavior,by=list(c(df_pred_graze$Date)),FUN=length)
  colnames(df_pred_graze)=c("Date","graze_Min")
  df_pred_graze$graze_Min=(df_pred_graze$graze_Min*5)/60

#subset out rest predictions, count the number per day and convert back to minutes
  df_pred_rest=subset(Accel_Merged,Behavior=='R')
  df_pred_rest=aggregate(df_pred_rest$Behavior,by=list(c(df_pred_rest$Date)),FUN=length)
  colnames(df_pred_rest)=c("Date","rest_Min")
  df_pred_rest$rest_Min=(df_pred_rest$rest_Min*5)/60
#Combine into one data frame column and calculate total
  df_Total=as.data.frame(cbind(as.character(df_pred_graze$Date),df_pred_graze$graze_Min,df_pred_rest$rest_Min,df_pred_walk$walk_Min))
  colnames(df_Total)=c("Date","Graze_Min","Rest_Min","Walk_Min")
  df_Total$Graze_Min=as.numeric(df_Total$Graze_Min)
  df_Total$Rest_Min=as.numeric(df_Total$Rest_Min)
  df_Total$Walk_Min=as.numeric(df_Total$Walk_Min)
  df_Total$Total_Minutes=df_Total$Graze_Min+df_Total$Rest_Min+df_Total$Walk_Min
  
#print table as output  
  knitr::kable(df_Total,digits=0,caption="Daily Behavior Estimates")
Daily Behavior Estimates
Date Graze_Min Rest_Min Walk_Min Total_Minutes
2017-06-10 67 136 15 218
2017-06-11 598 740 102 1440
2017-06-12 526 827 87 1440
2017-06-13 595 735 110 1440
2017-06-14 587 716 136 1440
2017-06-15 276 565 38 879

Example application of data

Estimating daily behavior can be used in a number of applications. For example we could look to see if changes in normal daily behavior can indicate sickness in animals. Or we can use daily behavior to estimate additional energy requirements of animals based on existing equations. The example below calculates Net Energy for Activity based on the accelerometer data from our example.

#head(df_NEm_Total$Graze_Min)
df_NEm<-df_Total[2:5,] #whole days only

#### CONVERT REST MINUTES TO HOURS 

#df_NEm$Graze_Min=df_NEm$Graze_Min/60
df_NEm$Rest_Min=df_NEm$Rest_Min/60
#df_NEm$Walk_Min=df_NEm$Walk_Min/60


#####COVERT Walk minutes to distance in kilometers per day

Walking_Rate<-1.05/1000 #km per second


Walk_Distance_Per_Min<-Walking_Rate*60 #in Kilometers per minute (i.e, 4 meters/minute * 60 seconds)

df_NEm$Avg_Distance_Walk<-df_NEm$Walk_Min*Walk_Distance_Per_Min
###Avg_Distance_Walk

####
Grazing_Walking_Rate<-0.093/1000 # km per second 
Graze_Distance_Per_Min<-Grazing_Walking_Rate*60 #in kilometers/minute
df_NEm$Avg_Distance_Grazed<-df_NEm$Graze_Min*Graze_Distance_Per_Min

###Avg_Distance_Grazed

Fraction_Distance_Flat<-0.5 #Assume that have the distance is traveled on flat ground
df_NEm$Distance_Slope_Km<-(df_NEm$Avg_Distance_Walk+df_NEm$Avg_Distance_Grazed)*(1-Fraction_Distance_Flat)
df_NEm$Distance_flat_Km<-(df_NEm$Avg_Distance_Walk+df_NEm$Avg_Distance_Grazed)*Fraction_Distance_Flat

####################


#HARD CODED PARAMETERS FROM TEDESCHI AND FOX 2020 (Page 295 , Equation 12.14)
Standing<- df_NEm$Rest_Hours
Position_Change<-6 #number of position changes per day (e.g., resting to walking)  
Average_slope <- 3.6 #(degrees) range



Average_slope_fraction = Average_slope/100
inclination<-atan(0.03663)*(180.0/pi)
#####inclination
df_NEm$km_ascending<- (df_NEm$Distance_Slope_Km-df_NEm$Distance_Slope_Km*cos(inclination*pi/180))/sin(inclination*pi/180)
FBW<-340 #kg 750lb steer Full Body Weight
#calculate NEm for each day
df_NEm$NEmr_act<- (0.1*df_NEm$Rest_Min+0.062*Position_Change+0.621*df_NEm$Distance_flat_Km+6.69*df_NEm$km_ascending)*FBW/1000
#plot daily
ggplot(df_NEm,aes(x=Date,y=NEmr_act))+
  geom_bar(stat="identity", fill="steelblue")+
  theme_minimal()+
  xlab("\nDate")+
  ylab('Mcal/day\n ')+
  ggtitle("Daily Net Energy Required for Physical Activity")